# Linear regression

A simple machine learning model that can uncover relationships in data.

Linear regression is a robust machine learning algorithm that is commonly used for modelling and analyzing data.

It is a simple and effective technique for discovering relationships between variables and predicting future outcomes. The basic premise of linear regression is to find the best linear relationship between the independent and dependent variables in a dataset. Doing so can help identify patterns, trends, and correlations in the data, enabling us to make informed decisions and accurate predictions.

Linear regression is a versatile tool with applications in various fields, from finance and economics to healthcare and engineering.


## How To

In [1]:
import pandas as pd
df = pd.read_csv("data/housing.csv")
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


## Preparing training data

In [2]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

In [41]:
x_train, x_test, y_train, y_test = train_test_split(df[["housing_median_age", "total_rooms", "median_income"]], 
                                                    df.median_house_value, test_size=.5,
                                                    stratify=df.ocean_proximity)

In [42]:
df.shape

(20640, 10)

In [43]:
x_train.shape

(10320, 3)

In [44]:
x_test.shape

(10320, 3)

## Building the model

In [45]:
model = LinearRegression()

In [46]:
model.fit(x_train, y_train)

LinearRegression()

In [47]:
model.score(x_test, y_test)

0.504466886613274

## Improving the model

In [48]:
from sklearn import preprocessing

In [49]:
x_val, x_test, y_val, y_test = train_test_split(x_test, y_test)

In [50]:
x_test.shape

(2580, 3)

In [51]:
scaler = preprocessing.StandardScaler()
model = LinearRegression()

In [52]:
scaler.fit(x_train)

StandardScaler()

In [53]:
x_scaled = scaler.transform(x_train)
x_scaled

array([[-1.48316536, -0.99365153, -0.87440976],
       [ 0.43085565, -0.39327003,  0.80370426],
       [ 1.86637141, -0.63119073, -0.92829028],
       ...,
       [ 1.86637141, -0.92223068, -0.98650237],
       [-0.0476496 , -0.29015619, -0.58562075],
       [ 0.5903574 , -0.44058635, -1.13308907]])

In [54]:
model.fit(x_scaled, y_train)

LinearRegression()

In [55]:
model.score(scaler.transform(x_val), y_val)

0.5118435778695601

In [56]:
scaler = preprocessing.MinMaxScaler().fit(x_train)
model = LinearRegression().fit(scaler.transform(x_train), y_train)
model.score(scaler.transform(x_val), y_val)

0.51184357786956

## Predicting with the Model

In [39]:
model.predict(scaler.transform(x_test))

array([307579.1482936 , 195257.7614732 , 210653.12200599, ...,
       134249.034224  , 147248.80763583, 347003.6324425 ])

In [38]:
y_test

8629     470000.0
6090     173300.0
18972    191300.0
5979     240700.0
8751     366900.0
           ...   
15627    500001.0
2761      81300.0
14886    146300.0
15177    210100.0
17047    500001.0
Name: median_house_value, Length: 2580, dtype: float64

## Inspecting the model

In [57]:
model.coef_

array([102231.51493819, 124088.28743787, 619644.91567053])

In [58]:
model.intercept_

-3233.8078487273597

## Exercise

Experiment how preprocessing can affect your data.

## Additional Resources

- [Model Selection](https://scikit-learn.org/stable/model_selection.html)
- [Scikit-Learn Train Test Split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)
- [Scikit-Learn Linear Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)